The Best Kept Secrets with Corpus Linguistics
نویسندگان
چکیده
This paper presents the use of corpus linguistics techniques on supposedly “clean” corpora and identifies potential pitfalls. Our work relates to the task of filtering sensitive content, in which data security is strategically important for the protection of government and military information, and of growing importance in combating identity fraud. A naïve keyword filtering approach produces a large proportion of false positives, and the need for more fine-grained approaches, suggests the consideration of using corpus linguistics for such content filtering. We present work undertaken on the Enron corpus, a collection of emails that has had various tasks undertaken on various portions and versions of it by other researchers. We have made some efforts to reconcile differences between the different versions by considering what impact some of these versions have on our results. Our anticipated efforts in using automatic ontology learning [Gillam, Tariq and Ahmad 2005; Gillam and Ahmad 2005], and local grammars [Ahmad, Gillam and Cheng 2006] to discover points of interest within the Enron corpus have been reoriented to the problem of discovering “confidentiality banners” with a view to removing them from the collection. Our work to date makes strong use of collocation patterns [Smada 1993] to identify the signatures of these banners, and to use the banners themselves to better understand the different versions of the Enron corpus. We further consider the use of extended collocation patterns to identify text “zones”, following [Teufel and Moens 2000], and the subsequent potential for sentiment analysis [Klimt and Yang 2004]; [Pang, Lee and Vaithyanathan 2002]
منابع مشابه
Associations among solicitation, relationship quality, and adolescents' disclosure and secrecy with mothers and best friends.
Disclosure and secrecy with mothers and best friends about personal, bad behavior, and multifaceted (e.g., staying out late) activities were examined using daily diaries among 102 ethnically diverse, urban middle adolescents (M = 15.18 years, SD = .89). Adolescents disclosed more and kept fewer secrets from best friends than from mothers and more frequently disclosed and kept secrets about thei...
متن کاملDo We Need Discipline-Specific Academic Word Lists? Linguistics Academic Word List (LAWL)
This corpus-based study aimed at exploring the most frequently-used academic words in linguistics and compare the wordlist with the distribution of high frequency words in Coxhead’s Academic Word List (AWL) and West’s General Service List (GSL) to examine their coverage within the linguistics corpus. To this end, a corpus of 700 linguistics research articles (LRAC), consisting of approximately ...
متن کاملA Corpus-Based Study of the Lexical Make-up of Applied Linguistics Article Abstracts
This paper reports results from a corpus-based study that explored the frequency of words in the abstracts of applied linguistics journal articles. The abstracts of major articles in leading applied linguists journals, published since 2005 up to November 2001 were analyzed using software modules from the Compleat Lexical Tutor. The output includes a list of the most frequent content words, list...
متن کاملSecrets from friends and parents: longitudinal links with depression and antisocial behavior.
Keeping secrets from parents is associated with depression and antisocial behavior. The current study tested whether keeping secrets from best friends is similarly linked to maladjustment, and whether associations between secrecy and maladjustment are moderated by the quality of the friendship. Adolescents (N = 181; 51% female, 48% white, non-Hispanic, 45% African American) reported their secre...
متن کاملConcordance-Based Data-Driven Learning Activities and Learning English Phrasal Verbs in EFL Classrooms
In spite of the highly beneficial applications of corpus linguistics in language pedagogy, it has not found its way into mainstream EFL. The major reasons seem to be the teachers’ lack of training and the unavailability of resources, especially computers in language classes. Phrasal verbs have been shown to be a problematic area of learning English as a foreign language due to their semantic op...
متن کامل